Lexicographic Potential of the Georgian Dialect Corpus
نویسندگان
چکیده
The project Linguistic Portrait of Georgia envisages various aspects of documentation of Georgian linguistic reality by means of corpus methodologies. This title is an umbrella for three large-scale projects within the framework of which The Georgian Dialect Corpus – GDC (http://corpora.co) was developed. Presently, the architecture and text base of the corpus have been designed, being permanently developed and updated. Besides, the lexicographic base of the corpus is organized, agglomerating data from printed dialect dictionaries. The lexical stock of the corpus is presented based on text, lexicographic and encyclopaedic data. The total quantity of tokens in the corpus is estimated to be up to 2 000 000, while the lexicographic base has 60 000 items (lemmas with entries) by now; this quantity is considerably increased owing to phonetic and grammatical variations, frequently associated with a single lexical item.
منابع مشابه
Globalization, Standardization, and Dialect Leveling in Iran
This paper is an attempt to shed light on the effects of modernization, urbanization, monolingual educational system, and mass media as well as the process of globalization on dialect leveling among Persian dialects. In so doing, the first part of the paper elaborates on the relationship between globalization and sociolinguistics, and on the concept of standardization. Also, it discusses some ...
متن کاملA Description of Derivational Affixes in Sarhaddi Balochi of Granchin
Sarhaddi Balochi dialect, a language variety of Western (Rakhshani) Balochi, employs derivation through affixation as one of its word formation processes. The purpose of this article is to present a synchronic description of the way(s) different derivational affixes function in making complex words in Sarhaddi Balochi as spoken in Granchin[1] district located about 35Kms to the southeast of Kha...
متن کاملStatistical Analysis of Vietnamese Dialect Corpus and Dialect Identification Experiments
The performance of speech recognition systems will be improved if the corpus is organized in the specialized domain and is applied in a consistent way for speech recognition in specific situations. Vietnamese dialects are various. The building of corpus for Vietnamese dialect is the first step for implementing the system of dialect identification used for increasing the performance of Vietnames...
متن کاملPreparation of MaDiTS corpus for Malay dialect translation and speech synthesis system
This paper presents our work in acquiring a Malay dialect translation and speech synthesis corpus. In this study, an architecture of speech corpus acquisition, which including Malay dialect translation and Malay dialect grapheme to phoneme (G2P), was proposed. The pronunciation dictionary for dialectal Malay was generated through G2P tool. As dialectal Malay is considered as scarce resource, di...
متن کاملA Super Phonetic System and Multi-dialect Chinese Speech Corpus for Speech Recognition
In this paper, we describe the work on Chinese multi-dialect speech processing. Based on the phonetic analysis of ten Chinese dialects, we have created a Chinese super phonetic system for the Chinese speech recognition. To exam this phonetic system and develop Chinese dialect speech technology, we are building a multi-dialect speech corpus, which includes 10 dialect areas and 2000 speakers.
متن کامل